86 research outputs found

    AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

    Full text link
    CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

    Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs

    Full text link
    As deep learning models nowadays are widely adopted by both cloud services and edge devices, reducing the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop efficient tensor programs for deep learning operators due to the high complexity of modern accelerators and the rapidly growing number of operators. Deep learning compilers, such as Apache TVM, adopt declarative scheduling primitives to lower the bar of developing tensor programs. However, we show that this approach is insufficient to cover state-of-the-art tensor program optimizations. In this paper, we propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering. This new approach greatly enriches the expressible optimizations by allowing developers to manipulate tensor programs at a much finer granularity. We call the proposed method the task-mapping programming paradigm. In addition, we propose a new post-scheduling fusion optimization that allows developers to focus on scheduling every single operator and automates the fusion after scheduling. It greatly reduces the engineering efforts for operator fusion. Our proposed paradigm also constructs an efficient hardware-centric schedule space, which is agnostic to the program input size and greatly reduces the tuning time. With the proposed paradigm, we implement a deep learning compiler Hidet. Extensive experiments on modern convolution and transformer models show that Hidet outperforms state-of-the-art DNN inference framework, ONNX Runtime, and compiler, TVM equipped with scheduler AutoTVM and Ansor, by up to 1.48x (1.22x on average). It also reduces the tuning time by 20x and 11x compared with AutoTVM and Ansor, respectively. We open-sourced hidet at https://www.github.com/hidet-org/hidet.Comment: 15 pages, 22 figures, 1 tabl

    Decoupled Model Schedule for Deep Learning Training

    Full text link
    Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price of sub-optimal model training performance. On the other hand, practitioners propose various approaches to improving the training efficiency by sacrificing some of the flexibility, ranging from making the graph static for more thorough optimization (e.g., XLA) to customizing optimization towards large-scale distributed training (e.g., DeepSpeed and Megatron-LM). In this paper, we aim to address the tension between usability and training efficiency through separation of concerns. Inspired by DL compilers that decouple the platform-specific optimizations of a tensor-level operator from its arithmetic definition, this paper proposes a schedule language to decouple model execution from definition. Specifically, the schedule works on a PyTorch model and uses a set of schedule primitives to convert the model for common model training optimizations such as high-performance kernels, effective 3D parallelism, and efficient activation checkpointing. Compared to existing optimization solutions, we optimize the model as-needed through high-level primitives, and thus preserving programmability and debuggability for users to a large extent. Our evaluation results show that by scheduling the existing hand-crafted optimizations in a systematic way, we are able to improve training throughput by up to 3.35x on a single machine with 8 NVIDIA V100 GPUs, and by up to 1.32x on multiple machines with up to 64 GPUs, when compared to the out-of-the-box performance of DeepSpeed and Megatron-LM

    TensorIR: An Abstraction for Automatic Tensorized Program Optimization

    Full text link
    Deploying deep learning models on various devices has become an important topic. The wave of hardware specialization brings a diverse set of acceleration primitives for multi-dimensional tensor computations. These new acceleration primitives, along with the emerging machine learning models, bring tremendous engineering challenges. In this paper, we present TensorIR, a compiler abstraction for optimizing programs with these tensor computation primitives. TensorIR generalizes the loop nest representation used in existing machine learning compilers to bring tensor computation as the first-class citizen. Finally, we build an end-to-end framework on top of our abstraction to automatically optimize deep learning models for given tensor computation primitives. Experimental results show that TensorIR compilation automatically uses the tensor computation primitives for given hardware backends and delivers performance that is competitive to state-of-art hand-optimized systems across platforms.Comment: Accepted to ASPLOS 202

    Ansor : Generating High-Performance Tensor Programs for Deep Learning

    Full text link
    High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering effort to develop platform-specific optimization code or fall short of finding high-performance programs due to restricted search space and ineffective exploration strategy. We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores many more optimization combinations by sampling programs from a hierarchical representation of the search space. Ansor then fine-tunes the sampled programs with evolutionary search and a learned cost model to identify the best programs. Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches. In addition, Ansor utilizes a task scheduler to simultaneously optimize multiple subgraphs in deep neural networks. We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to 3.8×3.8\times, 2.6×2.6\times, and 1.7×1.7\times, respectively.Comment: Published in OSDI 202

    CXCR5<sup>+</sup> follicular cytotoxic T cells control viral infection in B cell follicles

    Get PDF
    During unresolved infections, some viruses escape immunological control and establish a persistant reservoir in certain cell types, such as human immunodeficiency virus (HIV), which persists in follicular helper T cells (TFH cells), and Epstein-Barr virus (EBV), which persists in B cells. Here we identified a specialized group of cytotoxic T cells (TC cells) that expressed the chemokine receptor CXCR5, selectively entered B cell follicles and eradicated infected TFH cells and B cells. The differentiation of these cells, which we have called 'follicular cytotoxic T cells' (TFC cells), required the transcription factors Bcl6, E2A and TCF-1 but was inhibited by the transcriptional regulators Blimp1, Id2 and Id3. Blimp1 and E2A directly regulated Cxcr5 expression and, together with Bcl6 and TCF-1, formed a transcriptional circuit that guided TFC cell development. The identification of TFC cells has far-reaching implications for the development of strategies to control infections that target B cells and TFH cells and to treat B cell–derived malignancies

    Prevalence, associated factors and outcomes of pressure injuries in adult intensive care unit patients: the DecubICUs study

    Get PDF
    Funder: European Society of Intensive Care Medicine; doi: http://dx.doi.org/10.13039/501100013347Funder: Flemish Society for Critical Care NursesAbstract: Purpose: Intensive care unit (ICU) patients are particularly susceptible to developing pressure injuries. Epidemiologic data is however unavailable. We aimed to provide an international picture of the extent of pressure injuries and factors associated with ICU-acquired pressure injuries in adult ICU patients. Methods: International 1-day point-prevalence study; follow-up for outcome assessment until hospital discharge (maximum 12 weeks). Factors associated with ICU-acquired pressure injury and hospital mortality were assessed by generalised linear mixed-effects regression analysis. Results: Data from 13,254 patients in 1117 ICUs (90 countries) revealed 6747 pressure injuries; 3997 (59.2%) were ICU-acquired. Overall prevalence was 26.6% (95% confidence interval [CI] 25.9–27.3). ICU-acquired prevalence was 16.2% (95% CI 15.6–16.8). Sacrum (37%) and heels (19.5%) were most affected. Factors independently associated with ICU-acquired pressure injuries were older age, male sex, being underweight, emergency surgery, higher Simplified Acute Physiology Score II, Braden score 3 days, comorbidities (chronic obstructive pulmonary disease, immunodeficiency), organ support (renal replacement, mechanical ventilation on ICU admission), and being in a low or lower-middle income-economy. Gradually increasing associations with mortality were identified for increasing severity of pressure injury: stage I (odds ratio [OR] 1.5; 95% CI 1.2–1.8), stage II (OR 1.6; 95% CI 1.4–1.9), and stage III or worse (OR 2.8; 95% CI 2.3–3.3). Conclusion: Pressure injuries are common in adult ICU patients. ICU-acquired pressure injuries are associated with mainly intrinsic factors and mortality. Optimal care standards, increased awareness, appropriate resource allocation, and further research into optimal prevention are pivotal to tackle this important patient safety threat
    corecore